Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use generation in pod disruption budget #36438

Merged
merged 2 commits into from
Nov 9, 2016

Conversation

mwielgus
Copy link
Contributor

@mwielgus mwielgus commented Nov 8, 2016

Fixes #35324

Previously it was possible to use allowedDirsruptions calculated for the previous spec with the current spec. With generation check API servers always make sure that allowedDisruptions were calculated for the current spec.

At the same time I set the registry policy to only accept updates if the version based on which the update was made matches to the current version in etcd. That ensures that parallel eviction executions don't use the same allowed disruption.

cc: @davidopp @Kargakis @wojtek-t


This change is Reviewable

@mwielgus mwielgus added retest-not-required release-note-none Denotes a PR that doesn't merit a release note. labels Nov 8, 2016
@mwielgus mwielgus added this to the v1.5 milestone Nov 8, 2016
@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 8, 2016
// Most recent generation observed when updating this PDB status. PodDisruptionsAllowed and other
// status informatio is valid only if observedGeneration equals to PDB's object generation.
// +optional
ObservedGeneration *int64 `json:"observedGeneration,omitempty" protobuf:"varint,1,opt,name=observedGeneration"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that there is no accidental generation equality if observedgeneration/generation is not set. I followed the pattern we had with HPA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, on the other hand ReplicaSetStatus and DeploymentStatus use just int...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is no need for it to be a pointer.

@@ -61,7 +54,19 @@ type PodDisruptionBudgetStatus struct {
// the list automatically by PodDisruptionBudget controller after some time.
// If everything goes smooth this map should be empty for the most of the time.
// Large number of entries in the map may indicate problems with pod deletions.
DisruptedPods map[string]unversioned.Time `json:"disruptedPods" protobuf:"bytes,5,rep,name=disruptedPods"`
DisruptedPods map[string]unversioned.Time `json:"disruptedPods" protobuf:"bytes,2,rep,name=disruptedPods"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't change order of fields, as this affects protobuf representation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline - actually since this wasn't yet released, you probably can do it.

@@ -116,6 +116,10 @@ func (r *EvictionREST) Create(ctx api.Context, obj runtime.Object) (runtime.Obje
}

func (r *EvictionREST) checkAndDecrement(namespace string, podName string, pdb policy.PodDisruptionBudget) (ok bool, err error) {
if pdb.Status.ObservedGeneration == nil || *pdb.Status.ObservedGeneration != pdb.Generation {
return false, fmt.Errorf("pdb status has not been updated after spec change")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should be an error. Because this is the situation that will be happening in a real system. It just means that we didn't yet updated PDB. So I think that we should:

return false, nil

here.

For me error is for something that is not expected, and this can definitely happen.

Copy link
Contributor

@0xmichalis 0xmichalis Nov 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eviction needs to be retried until observedGeneration matches generation. Two options I see: either poll for some time here or force the request to be retried (by returning a ServerTimeoutError)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or both.

@@ -116,6 +116,10 @@ func (r *EvictionREST) Create(ctx api.Context, obj runtime.Object) (runtime.Obje
}

func (r *EvictionREST) checkAndDecrement(namespace string, podName string, pdb policy.PodDisruptionBudget) (ok bool, err error) {
if pdb.Status.ObservedGeneration == nil || *pdb.Status.ObservedGeneration != pdb.Generation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing that bumps Generation since there is no spec change happening so it will always be 1 for PDBs. You will either need to bump Generation here when you do the update or you should probably use a different pair of Generation/ObservedGeneration because Generation is meant to be bumped only on spec changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked with @wojtek-t and we initially agreed that there is optimistic locking implemented in client/etcd that would make the extra generation not needed. However now he has some second thoughts how it exactly works and if there are retries on version mismatch. We are checking it right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this solves the issue with Generation not being incremented by anything atm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i understand correctly each object has a version that is incremented every update. It is possible to allow update only if the internal etcd version equals to the version in the update. With that no 2 concurrent updates would be able to use the same budget at the same time. The time the first one gets in the version is incremented and the second one fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though - ObservedGeneration doesn't provide any value now (as @Kargakis observed, because spec doesn't change) - so we should just remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. It protect us against unsynced spec changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the unconditionalUpdate change helps because all it does is to allow unconditional updates if the incoming resource version is zero, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t
Copy link
Member

wojtek-t commented Nov 8, 2016

Please don't add "retest-not-required" in such large PRs

@k8s-ci-robot
Copy link
Contributor

Jenkins unit/integration failed for commit f8991c1. Full PR test history.

The magic incantation to run this job again is @k8s-bot unit test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@wojtek-t
Copy link
Member

wojtek-t commented Nov 8, 2016

This generally LGTM - but please fix tests as the failure is related.

@@ -116,6 +116,9 @@ func (r *EvictionREST) Create(ctx api.Context, obj runtime.Object) (runtime.Obje
}

func (r *EvictionREST) checkAndDecrement(namespace string, podName string, pdb policy.PodDisruptionBudget) (ok bool, err error) {
if pdb.Status.ObservedGeneration != pdb.Generation {
Copy link
Contributor

@0xmichalis 0xmichalis Nov 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing increments Generation hence this will be always false once the controller observes a PDB for the first time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I though that Generation is automatically incremented by our infrastructure on every Spec update, but it seems this has to be done manually. For example RC is doing it here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/controller/strategy.go#L81

@mwielgus - this should be added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already there for PDB but I thought the spec is immutable.

newPodDisruptionBudget.Generation = oldPodDisruptionBudget.Generation + 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah ok, thanks.
Why do you say that spec is immutable? Is there some validation rule that disallows updates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you say that spec is immutable? Is there some validation rule that disallows updates?

Currently yes:

allErrs = append(allErrs, field.Forbidden(field.NewPath("spec"), "updates to poddisruptionbudget spec are forbidden."))

@@ -91,9 +91,10 @@ func (podDisruptionBudgetStrategy) ValidateUpdate(ctx api.Context, obj, old runt
return append(validationErrorList, updateErrorList...)
}

// AllowUnconditionalUpdate is the default update policy for PodDisruptionBudget objects.
// AllowUnconditionalUpdate is the default update policy for PodDisruptionBudget objects. Status update should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllowUnconditionalUpdate is used only if the resource version in the UPDATE request is zero:

doUnconditionalUpdate := resourceVersion == 0 && e.UpdateStrategy.AllowUnconditionalUpdate()

How is this helping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newVersion, err := e.Storage.Versioner().ObjectResourceVersion(obj)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, seems like a small improvement but we already use a resource version in the eviction REST so you would still get an update conflict on a different rv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eviction rest is per pod. When pods are different there is no conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't talking about pods but about PDB. Even if AllowUnconditionalUpdate returns true, and the eviciton REST tries to update the status of a PDB with a rv older than what etcd has you will get an update conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I am fine with this change but it doesn't solve the issue that is solved by Generation/ObservedGeneration.

@mwielgus
Copy link
Contributor Author

mwielgus commented Nov 8, 2016

Test fixed.

@mwielgus
Copy link
Contributor Author

mwielgus commented Nov 8, 2016

@Kargakis Are you ok with merging this PR?

@0xmichalis
Copy link
Contributor

@Kargakis Are you ok with merging this PR?

At its current form, what you have is not going to work. There are two issues at play: 1) you need the eviction REST to increment pdb.Generation every time it updates the status of the PDB but 2) Generation is supposed to be incremented ONLY on spec changes so you essentially change its meaning. I am proposing a separate set of fields (EvictionGeneration / EvictionObservedGeneration) that will be specific to the Eviction update. I think @bgrant0607 has proposed somewhere that every component that updates an object and wants to observe its update, should have it's own generation/observedGeneration pair.

@0xmichalis
Copy link
Contributor

Ref #7328 (comment)

@mwielgus
Copy link
Contributor Author

mwielgus commented Nov 8, 2016

Honestly i don't understand why exactly it won't work. Object version (along with the policy) should guarantee that no two pdb status updates occur at the same time, because the second one would have an incorrect version. This is a Compare-And-Swap type of operation.

Any extra field/pair of fields that is required to be handled by both apiserver and controller on every update would basically mean a full round trip from APIserver to controller and back and would make PDB really slow.

@0xmichalis
Copy link
Contributor

should guarantee that no two pdb status updates occur at the same time

This should already be fixed in master (the eviction REST is not setting rv to zero so it never does an unconditional update for a PDB). You still don't guarantee that what the eviction REST sees at a given point in time for a PDB (specifically pdb.status.PodDisruptionsAllowed) is true iow. observed by the PDB controller.

@mwielgus
Copy link
Contributor Author

mwielgus commented Nov 8, 2016

Yes, there is no guarantee that what eviction handler sees is true because a change could occur just a second ago and is not yet noticed by PDB controller. The question is whether for perfect status setting by the controller would eviction handlers behave properly - in a sense that no unallowed disruption is made. I think that this PR makes it possible.

@0xmichalis
Copy link
Contributor

Yes, there is no guarantee that what eviction handler sees is true because a change could occur just a second ago and is not yet noticed by PDB controller.

Assuming all pod deletions go through eviction, generation/observedGeneration guarantees that what eviciton sees is true. If a change (meaning eviction) occurs, pdb.{status,spec}.evictionGeneration is incremented and the next eviction will need to wait for pdb.status.observedEvictionGeneration to match the new pdb.status.evictionGeneration. I can't think of any other way to synchronize the api server and the pdb controller.

The question is whether for perfect status setting by the controller would eviction handlers behave properly - in a sense that no unallowed disruption is made. I think that this PR makes it possible.

I am not sure of the additional value at this point.

  1. Eviction subresource doesn't update unconditionally anyway.
  2. pdb.Generation will always be 1 as the code is. The pdb.Generation != pdb.Status.ObservedGeneration will always be false once the controller notices the PDB for the first time.

@mwielgus
Copy link
Contributor Author

mwielgus commented Nov 8, 2016

@Kargakis When I'm talking about version I'm not referring to generation or observed generation but resource version that is inside of every object and which is increased by Etcd on every update that happens on the object.
So the scenario is like that:

  1. You create a new pdb. PDB is in ResourceVersion 1.
  2. PDB Controller checks the pods, updates pdb, ETCD checks if version is still 1 and sets ResourceVersion to 2.
  3. The client tries to evict pod X. The Eviction handler gets the current version of PDB (2) checks if the eviction is allowed, updates allowedDisruptions, DisruptedPodMap and updates PDB. ETCD checks if the current ResourceVersion is 2 and updates ResourceVersion to 3.
  4. PDB Controller gets notification about PDB change, checks the pods, updates the status and ETCD bumps ResourceVersion to 4.
  5. Two parallel evictions occur.
    5A) First gets PDB in ResourceVersion 4. Checks whether the eviction is possible. It is allowed so the handler tries to update ETCD.
    5B) Second gets PDB in ResourceVersion 4. Checks whether the eviction is possible. It is allowed so the handler tries to update ETCD.

6A) ETCD receives the update. Checks what is the current version 4. Accepts the update and bumps version to 5. The corresponding pod is deleted.
6B) ETCD receives the update. The update is for version 4. The current version is 5. The update is rejected. A retry is needed. The pod is NOT deleted.

7B) Handler gets PDB in version 5. Checks whether the eviction is possible. It is not. The negative response is returned to the client.
8B) After a while PDB controller gets PDB and updates the number of allowed disruptions.

Observed Generation is only needed to ensure that the new/updated spec was noticed by the controller.

@davidopp
Copy link
Member

davidopp commented Nov 9, 2016

I would suggest to have @bgrant0607 or @lavalamp review this. I thought the idea of generation was that the PDB controller should set observed generation == current generation (TBH I don't remember how current generation is represented in general -- is it just the resource version on the controller?) when its status reflects the effects of its spec. So for example it prevents you from using Status.PodDisruptionsAllowed right after you change MinAvailable or Selector, until Status.PodDisruptionsAllowed reflects the new value of MinAvailable or Selector. That doesn't seem to be what you're doing here. But I admit I may be misunderstanding the observed generation stuff.

@mwielgus
Copy link
Contributor Author

mwielgus commented Nov 9, 2016

@davidopp
This is exactly what what is being done here. I introduce observed generation to ensure that status was created against the current spec. At the same time I set the registry policy to only accept updates if the version based on which the update was made matches to the current version in etcd. That ensures that parallel eviction executions don't use the same allowed disruption.

@0xmichalis
Copy link
Contributor

  1. You create a new pdb. PDB is in ResourceVersion 1.
  2. PDB Controller checks the pods, updates pdb, ETCD checks if version is still 1 and sets ResourceVersion to 2.
  3. The client tries to evict pod X. The Eviction handler gets the current version of PDB (2) checks if the eviction is allowed, updates allowedDisruptions, DisruptedPodMap and updates PDB. ETCD checks if the current ResourceVersion is 2 and updates ResourceVersion to 3.
  4. PDB Controller gets notification about PDB change, checks the pods, updates the status and ETCD bumps ResourceVersion to 4.
  5. Two parallel evictions occur.
    5A) First gets PDB in ResourceVersion 4. Checks whether the eviction is possible. It is allowed so the handler tries to update ETCD.
    5B) Second gets PDB in ResourceVersion 4. Checks whether the eviction is possible. It is allowed so the handler tries to update ETCD.
    6A) ETCD receives the update. Checks what is the current version 4. Accepts the update and bumps version to 5. The corresponding pod is deleted.
    6B) ETCD receives the update. The update is for version 4. The current version is 5. The update is rejected. A retry is needed. The pod is NOT deleted.

This should already happen in master because in order to do an unconditional update you need also to pass a zero rv but the eviction REST does a normal (conditional) update.

@0xmichalis
Copy link
Contributor

I would suggest to have @bgrant0607 or @lavalamp review this. I thought the idea of generation was that the PDB controller should set observed generation == current generation (TBH I don't remember how current generation is represented in general -- is it just the resource version on the controller?) when its status reflects the effects of its spec. So for example it prevents you from using Status.PodDisruptionsAllowed right after you change MinAvailable or Selector, until Status.PodDisruptionsAllowed reflects the new value of MinAvailable or Selector. That doesn't seem to be what you're doing here. But I admit I may be misunderstanding the observed generation stuff.

Yes, this already happens in this PR. It doesn't fix #35324 though because the eviction REST cannot be sure about PodDisruptionsAllowed at any given point (spec updates to the PDB are not going to happen often anyway). I imagine that requiring from eviction to wait for every prior eviction to be observed by the controller will be really slow. We could potentially allow an amount of deviation (for example proceed with the eviction if evictionGeneration <= evictionObservedGeneration + minAvailable)

@mwielgus
Copy link
Contributor Author

mwielgus commented Nov 9, 2016

Could you please describe a scenario when PodDisruptionsAllowed is incorrect?

@0xmichalis
Copy link
Contributor

Unless all deletes go through the eviction REST, PodDisruptionsAllowed may end up out of sync easily because the PDB controller is using caches and it may lag behind (even a lot in some corner cases).

Example:

  1. PDB selects 10 pods owned by rs/foo, minAvailable is 5, PodDisruptionsAllowed is 5.
  2. rs/foo is scaled down to 5, PodDisruptionsAllowed will stay as 5 and the eviction REST will be able to evict pods until the PDB controller observes the rs update.

@mwielgus
Copy link
Contributor Author

mwielgus commented Nov 9, 2016

PodDisruptionBudget will not stay at 5 because PDB update is triggered also on pod status change.
See: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/disruption/disruption.go#L117
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/disruption/disruption.go#L356

Yes, for a brief moment PDB status will not be accurate but you have no way to prevent it.

@0xmichalis
Copy link
Contributor

PodDisruptionBudget will not stay at 5 because PDB update is triggered also on pod status change.

That's the part where I am talking about the caches the PDB controller is using. Replica set and pod caches are going to have a ton of objects so eventually they will start lagging.

What is the plan with the eviction api? I know it has already been added in kubectl drain. Is there any plan to switch controller managers like the replica set controller to use it too?

@mwielgus
Copy link
Contributor Author

mwielgus commented Nov 9, 2016

I guess that eventually everything will go through eviction handler. Until then there is absolutely no possibility for pdb controller to have super-current status, no matter how many extra fields we add there.

Right now we try to guarantee that if eviction handler is not bypassed (direct pod deletions, node crashes, etc) then PDB works as intended.

@0xmichalis
Copy link
Contributor

Ok, observedGeneration/generation changes in this PR LGTM.

@mwielgus
Copy link
Contributor Author

mwielgus commented Nov 9, 2016

Applying LGTM label as LGTM was given from both @wojtek-t and @Kargakis.

@davidopp I will ask @lavalamp to take a look at this PR as soon as he gets back to the office next week. We will do a follow-up fix if he finds something suspicious.

@mwielgus mwielgus added lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Nov 9, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 5d4d596 into kubernetes:master Nov 9, 2016
k8s-ci-robot pushed a commit that referenced this pull request Oct 9, 2019
Typo from 8e2347370e (Add observedGeneration to
PodDisruptionBudgetStatus, 2016-11-08, #36438).
k8s-publishing-bot pushed a commit to kubernetes/api that referenced this pull request Oct 9, 2019
Typo from kubernetes/kubernetes@8e2347370e (Add observedGeneration to
PodDisruptionBudgetStatus, 2016-11-08, kubernetes/kubernetes#36438).

Kubernetes-commit: a05cabb3a62c18c9cff9ae8c0b77f2f0e57a5ea7
k8s-publishing-bot pushed a commit to kubernetes/kubectl that referenced this pull request Oct 9, 2019
Typo from kubernetes/kubernetes@8e2347370e (Add observedGeneration to
PodDisruptionBudgetStatus, 2016-11-08, kubernetes/kubernetes#36438).

Kubernetes-commit: a05cabb3a62c18c9cff9ae8c0b77f2f0e57a5ea7
ohsewon pushed a commit to ohsewon/kubernetes that referenced this pull request Oct 16, 2019
Typo from kubernetes/kubernetes@8e2347370e (Add observedGeneration to
PodDisruptionBudgetStatus, 2016-11-08, kubernetes#36438).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants